pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, caret, igraph, wordcloud)Take-home_Ex03
Getting Started
The code chunk below will be used to install and load the necessary R packages to meet the data preparation, data wrangling, data analysis and visualisation needs.
Data Import
In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.
mc3_data <- fromJSON("data/MC3.json")The output is called mc3_data. It is a large list R object.
Extracting edges
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
mc3_edges <- as_tibble(mc3_data$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
Extracting nodes
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)Warning: There was 1 warning in `mutate()`.
ℹ In argument: `revenue_omu = as.numeric(as.character(revenue_omu))`.
Caused by warning:
! NAs introduced by coercion
Initial Data Exploration
Exploring the edges data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
skim(mc3_edges)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is not missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
DT::datatable(mc3_edges)Warning in instance$preRenderHook(instance): It seems your data is too
big for client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
Initial Network Visualisation and Analysis
Building network model with tidygraph
id1 <- mc3_edges %>%
select(source) %>%
rename(id = source)
id2 <- mc3_edges %>%
select(target) %>%
rename(id = target)
mc3_nodes1 <- rbind(id1, id2) %>%
distinct() %>%
left_join(mc3_nodes,
unmatched = "drop")Joining with `by = join_by(id)`
mc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = mc3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()Warning in geom_node_point(aes(size = betweenness_centrality, colors =
"lightblue", : Ignoring unknown aesthetics: colours
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Exploring the nodes data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
skim(mc3_nodes)| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
DT::datatable(mc3_nodes)Warning in instance$preRenderHook(instance): It seems your data is too
big for client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
Text Sensing with tidytext
In this section, you will learn how to perform basic text sensing using appropriate functions of tidytext package.
Simple word count
The code chunk below calculates number of times the word fish appeared in the field product_services.
mc3_nodes %>%
mutate(n_fish = str_count(product_services, "fish")) # A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashimi … Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conservati… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liability … Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
Tokenisation
The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
token_nodes <- mc3_nodes %>%
unnest_tokens(word,
product_services)The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
Now we can visualise the words extracted by using the code chunk below.
token_nodes %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Removing stopwords
Lucky for use, the tidytext package has a function called stop_words that will help us clean up stop words.
Let’s give this a try next!
stopwords_removed <- token_nodes %>%
anti_join(stop_words)Joining with `by = join_by(word)`
Now we can visualise the words extracted by using the code chunk below.
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")Selecting by n

Distribution Analysis
Here is the distribution of two different types in edges.
ggplot(data = mc3_edges,
aes(x = type)) +
geom_bar() +
labs(title = "Type variable in Edges") # Add the plot title
Here is the distribution of three different types in nodes.
ggplot(data = mc3_nodes,
aes(x = type)) +
geom_bar() +
labs(title = "Type variable in Nodes") # Add the plot title
Here is the boxplot graph of revenue distribution across nodes type.
As can be seen from the graph, “Benefical Owner” has the highest median of revenue. “Company” has lots of outliers but with the lowest median of revenue. For “Company Contracts”, this category has the least variation, with the median value at around 100,000.
ggplot(data = mc3_nodes, aes(x = type, y = revenue_omu)) +
geom_boxplot() +
scale_y_log10() + # Apply logarithmic scale to the y-axis
ylab("Revenue") +
ggtitle("Distribution of Revenue by Node Type")Warning: Removed 21515 rows containing non-finite values (`stat_boxplot()`).

Text Visualization
This part use wordcloud library is called with the filtered words and their frequencies. The words are extracted from “product_services” to find which type of products and services are the most transported among all the goods categories. As can be seen from the graph, the words that appear at a high frequency are: fish, seafood, frozen, salmon and fresh.
library(tidytext)
library(wordcloud)
# Create a tidy text data frame from the product_services column
tidy_data <- mc3_nodes %>%
select(product_services) %>%
unnest_tokens(word, product_services)
# Load the stopwords dataset
data(stop_words)
# Filter out stopwords and common meaningless words
filtered_data <- tidy_data %>%
anti_join(stop_words) %>%
filter(!word %in% c("0", "other", "as", "a", "such", "for", "as", "the", "related", "unknown", "character"))Joining with `by = join_by(word)`
# Compute word frequencies
word_freq <- filtered_data %>%
count(word)
# Filter words based on frequency
min_freq <- 10
max_words <- 200
filtered_words <- word_freq %>%
filter(n >= min_freq) %>%
top_n(max_words, wt = n) %>%
pull(word)
# Generate the word cloud
word_freq_filtered <- word_freq %>%
filter(word %in% filtered_words)
# Define colors for the word cloud (darker colors)
colors <- brewer.pal(length(filtered_words), "Dark2")Warning in brewer.pal(length(filtered_words), "Dark2"): n too large, allowed maximum for palette Dark2 is 8
Returning the palette you asked for with that many colors
wordcloud(
words = word_freq_filtered$word,
freq = word_freq_filtered$n,
colors = colors
)
Network Visualization
The following network graph shows the nodes and edges that the centrality_betweenness score is larger than 100,000 and degree is above 3.
library(igraph)
library(visNetwork)
# Convert the edges graph to a tibble
edges_df <- mc3_graph %>%
activate(edges) %>%
as_tibble()
# Convert the nodes graph to a tibble
nodes_df <- mc3_graph %>%
activate(nodes) %>%
as_tibble() %>%
rename(label = id) %>%
mutate(id = row_number()) %>%
select(id, label)
# Perform community detection using the Louvain algorithm on the graph
communities <- cluster_louvain(mc3_graph)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df$group <- membershipmc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = mc3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = as.factor(centrality_closeness())) %>%
filter(betweenness_centrality >= 100000)
# Calculate the degrees of each node
degrees <- degree(mc3_graph)
set.seed (1234)
mc3_graph %>%
# filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = 0.5)) +
geom_node_point(aes(size = betweenness_centrality,
color = closeness_centrality,
alpha = 0.5), show.legend = FALSE) +
geom_node_text(aes(label = ifelse(degrees > 3, as.character(id), "")), size = 2) + # Add node labels
scale_size_continuous(range = c(1, 10)) +
labs(title = "Network Visualization") + # Add the plot title
theme_graph()Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

This code plots the graph of network visualization to show the nodes and edges of “beneficial owner” and “company”.
library(ggraph)
mc3_graph %>%
filter(betweenness_centrality >= 1000000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = 0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
color = type,
shape = type),
alpha = 0.8) +
scale_size_continuous(range = c(1, 6)) +
scale_color_manual(values = c("Company Contacts" = "yellow", "Beneficial Owner" = "red", "Company" = "blue")) +
scale_shape_manual(values = c("Company Contacts" = 17, "Beneficial Owner" = 15, "Company" = 16)) +
theme_graph() +
labs(title = "Network Visualization") +
theme(plot.title = element_text(hjust = 0.5))Warning: Removed 34 rows containing missing values (`geom_point()`).
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Group Detection
edges_df <- mc3_graph %>%
activate(edges) %>%
as.tibble()Warning: `as.tibble()` was deprecated in tibble 2.0.0.
ℹ Please use `as_tibble()` instead.
ℹ The signature and semantics have changed, see `?as_tibble`.
nodes_df <- mc3_graph %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label)
# Perform community detection using the cluster edge betweenness
communities <- cluster_edge_betweenness(mc3_graph)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df$group <- membership
# Plot the graph with clustered nodes using visNetwork
visNetwork(nodes_df, edges_df) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE,
type = "curvedCW"),
color = list(highlight = "lightgray")) %>%
visOptions(highlightNearest = list(enabled = TRUE,
degree = 1,
hover = TRUE,
labelOnly = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visLayout(randomSeed = 1234)